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Abstract 

We propose a technique for learning rep¬ 
resentations of parser states in transition- 
based dependency parsers. Our primary 
innovation is a new control structure for 
sequence-to-sequence neural networks— 
the stack LSTM. Like the conventional 
stack data structures used in transition- 
based parsing, elements can be pushed to 
or popped from the top of the stack in 
constant time, but, in addition, an LSTM 
maintains a continuous space embedding 
of the stack contents. This lets us formu¬ 
late an efficient parsing model that cap¬ 
tures three facets of a parser’s state: (i) 
unbounded look-ahead into the buffer of 
incoming words, (ii) the complete history 
of actions taken by the parser, and (iii) the 
complete contents of the stack of partially 
built tree fragments, including their inter¬ 
nal structures. Standard backpropagation 
techniques are used for training and yield 
state-of-the-art parsing performance. 

1 Introduction 


Transition-based dependency parsing formalizes 
the parsing problem as a series of decisions that 
read words sequentially from a buffer and combine 


them incrementally into syntactic structures ( 

Ya- 

mada and Matsumoto, 2003 Nivre, 2003 Nivre, 

2004 

1. This formalization is attractive since 

the 


number of operations required to build any projec¬ 
tive parse tree is linear in the length of the sen¬ 
tence, making transition-based parsing computa¬ 
tionally efficient relative to graph- and grammar- 
based formalisms. The challenge in transition- 
based parsing is modeling which action should be 
taken in each of the unboundedly many states en¬ 
countered as the parser progresses. 

This challenge has been addressed by develop¬ 
ment of alternative transition sets that simplify the 
modeling problem by making better attachment 


decisions (Nivre, 2007 

Nivre, 2008[ 

Nivre, 2009 

Choi and McCallum, 2013 Bohnet and Nivre, 

2012), through feature engineering (Zhang and 

Nivre, 20111 Ballesteros and Nivre, 2014[|Chen et 

ah, 2014HBallesteros and Bohnet, 2014 

1 and more 

recently using neural networks ( 

Chen and Man- 


ning, 2014}|Stenetorp, 2013| l. 


We extend this last line of work by learning 
representations of the parser state that are sensi¬ 
tive to the complete contents of the parser’s state: 
that is, the complete input buffer, the complete 
history of parser actions, and the complete con¬ 
tents of the stack of partially constructed syn¬ 
tactic structures. This “global” sensitivity to the 
state contrasts with previous work in transition- 
based dependency parsing that uses only a nar¬ 
row view of the parsing state when constructing 
representations (e.g., just the next few incoming 
words, the head words of the top few positions 
in the stack, etc.). Although our parser integrates 
large amounts of information, the representation 
used for prediction at each time step is constructed 
incrementally, and therefore parsing and training 
time remain linear in the length of the input sen¬ 
tence. The technical innovation that lets us do this 
is a variation of recurrent neural networks with 
long short-term memory units (LSTMs) which we 
call stack LSTMs (^, and which support both 
reading (pushing) and “forgetting” (popping) in¬ 
puts. 


Our parsing model uses three stack LSTMs: one 
representing the input, one representing the stack 
of partial syntactic trees, and one representing the 
history of parse actions to encode parser states 
(Q. Since the stack of partial syntactic trees may 
contain both individual tokens and partial syntac¬ 
tic structures, representations of individual tree 
fragments are computed compositionally with re¬ 
cursive (i.e., similar to Socher et ah, 2014) neural 
networks. The parameters are learned with back- 
propagation (Q, and we obtain state-of-the-art re¬ 
sults on Chinese and English dependency parsing 
tasks (^. 













































2 Stack LSTMs 


In this section we provide a brief review of LSTMs 
(P.ll) and then define stack LSTMs (§2.2|l. 


Notation. We follow the convention that vectors 
are written with lowercase, boldface letters (e.g., v 
or 'Vw)', matrices are written with uppercase, bold¬ 
face letters (e.g., M, Mq, or Mafe), and scalars are 
written as lowercase letters (e.g., s or q^). Struc¬ 
tured objects such as sequences of discrete sym¬ 
bols are written with lowercase, bold, italic letters 
(e.g., w refers to a sequence of input words). Dis¬ 
cussion of dimensionality is deferred to the exper¬ 
iments section below (Q. 


2.1 Long Short-Term Memories 

LSTMs are a variant of recurrent neural networks 
(RNNs) designed to cope with the vanishing gra¬ 


dient problem inherent in RNNs (Hochreiter and 


Schmidhuber, 1997 Graves, 20131. RNNs read 


a vector at each time step and compute a 
new (hidden) state hi by applying a linear map 
to the concatenation of the previous time step’s 
state hi_i and the input, and passing this through 
a logistic sigmoid nonlinearity. Although RNNs 
can, in principle, model long-range dependencies, 
training them is difficult in practice since the re¬ 
peated application of a squashing nonlinearity at 
each step results in an exponential decay in the er¬ 
ror signal through time. LSTMs address this with 
an extra memory “cell” (c*) that is constructed as a 
linear combination of the previous state and signal 
from the input. 

LSTM cells process inputs with three multi¬ 
plicative gates which control what proportion of 
the current input to pass into the memory cell (i*) 
and what proportion of the previous memory cell 
to “forget” (ft). The updated value of the memory 
cell after an input x* is computed as follows: 


it = cr(Wt^xt -h Wtftht_i -h WicCt-i -h bi) 
ft = cr(W f^xt + W fhht-i + W fcCt-i + hf) 

Ct = ft © ct_i-h 

it 0 tanh(Wca:Xt -f Wchht-i -f be), 


where a is the component-wise logistic sig¬ 
moid function, and 0 is the component-wise 
(Hadamard) product. 

The value ht of the LSTM at each time step is 
controlled by a third gate (ot) that is applied to the 
result of the application of a nonlinearity to the 


memory cell contents: 


— (j(WqxX^ -|- “1“ be) 

hi = Ot 0 tanh(ci). 


To improve the representational capacity of 
LSTMs (and RNNs generally), LSTMs can be 
stacked in “layers” ( [Pascanu et ah, 20i4| ). In these 
architectures, the input LSTM at higher layers at 
time t is the value of ht computed by the lower 
layer (and xt is the input at the lowest layer). 

Finally, output is produced at each time step 
from the ht value at the top layer: 

yt = 5f(ht), 


where g is an arbitrary differentiable function. 


2.2 Stack Long Short-Term Memories 

Conventional LSTMs model sequences in a left- 
to-right orderj^ Our innovation here is to augment 
the LSTM with a “stack pointer.” Like a conven¬ 
tional LSTM, new inputs are always added in the 
right-most position, but in stack LSTMs, the cur¬ 
rent location of the stack pointer determines which 
cell in the LSTM provides Ct-i and ht_i when 
computing the new memory cell contents. 

In addition to adding elements to the end of the 
sequence, the stack LSTM provides a pop oper¬ 
ation which moves the stack pointer to the previ¬ 
ous element (i.e., the previous element that was 
extended, not necessarily the right-most element). 
Thus, the LSTM can be understood as a stack im¬ 
plemented so that contents are never overwritten, 
that is, push always adds a new entry at the end of 
the list that contains a back-pointer to the previous 
top, and pop only updates the stack pointer]^ This 
control structure is schematized in Figure [T] 

By querying the output vector to which the stack 
pointer points (i.e., the hjop), a continuous-space 
“summary” of the contents of the current stack 
configuration is available. We refer to this value 
as the “stack summary.” 


What does the stack summary look like? In¬ 
tuitively, elements near the top of the stack will 


*Ours is not the first deviation from a strict left-to- 
right order: previous variations include bidirectional LSTMs 
( [Graves an d Schmidhuber, 2005 | l and multidimensional 
LSTMs ([Graves et al.. 2007 1 . 


Goldberg et al. (2013 1 propose a similar stack construc¬ 


tion to prevent stack operations from invalidating existing ref¬ 
erences to the stack in a beam-search parser that must (effi¬ 
ciently) maintain a priority queue of stacks. 




















Figure 1: A stack LSTM extends a conventional left-to-right LSTM with the addition of a stack pointer 
(notated as TOP in the figure). This figure shows three configurations: a stack with a single element (left), 
the result of a pop operation to this (middle), and then the result of applying a push operation (right). 
The boxes in the lowest rows represent stack contents, which are the inputs to the LSTM, the upper rows 
are the outputs of the LSTM (in this paper, only the output pointed to by TOP is ever accessed), and the 
middle rows are the memory cells (the c^’s and h^’s) and gates. Arrows represent function applications 
(usually affine transformations followed by a nonlinearity), refer to §2.1|for specifics. 


influence the representation of the stack. How¬ 
ever, the LSTM has the flexibility to learn to ex¬ 
tract information from arbitrary points in the stack 
(|Hochreiter and Schmidhuber, 1997). 


Although this architecture is to the best of 
our knowledge novel, it is reminiscent of the 
Recurrent Neural Network Pushdown Automa¬ 
ton (NNPDA) of Das et al. (1992| ), which added an 
external stack memory to an RNN. However, our 
architecture provides an embedding of the com¬ 
plete contents of the stack, whereas theirs made 
only the top of the stack visible to the RNN. 


3 Dependency Parser 

We now turn to the problem of learning represen¬ 
tations of dependency parsers. We preserve the 
standard data structures of a transition-based de¬ 
pendency parser, namely a buffer of words {B) 
to be processed and a stack {S) of partially con¬ 
structed syntactic elements. Each stack element 
is augmented with a continuous-space vector em¬ 
bedding representing a word and, in the case of 
S, any of its syntactic dependents. Additionally, 
we introduce a third stack (A) to represent the his¬ 
tory of actions taken by the parser]^ Each of these 
stacks is associated with a stack ESTM that pro¬ 
vides an encoding of their current contents. The 
full architecture is illustrated in Eigure and we 
will review each of the components in turn. 


3.1 Parser Operation 


The dependency parser is initialized by pushing 
the words and their representations (we discuss 
word representations below in 13.3 1 of the input 
sentence in reverse order onto B such that the first 
word is at the top of B and the ROOT symbol is at 
the bottom, and S and A each contain an empty- 
stack token. At each time step, the parser com¬ 
putes a composite representation of the stack states 
(as determined by the current configurations of B, 
S, and A) and uses that to predict an action to take, 
which updates the stacks. Processing completes 
when B is empty (except for the empty-stack sym¬ 
bol), S contains two elements, one representing 
the full parse tree headed by the ROOT symbol and 
the other the empty-stack symbol, and A is the his¬ 
tory of operations taken by the parser. 


The parser state representation at time t, which 
we write p^, which is used to is determine the tran¬ 
sition to take, is defined as follows: 


pt = max{0, W[sf;bi;at] -f d} , 


where W is a learned parameter matrix, b* is 
the stack ESTM encoding of the input buffer B, 
St is the stack ESTM encoding of S, at is the 
stack ESTM encoding of A, d is a bias term, then 
passed through a component-wise rectified linear 
unit (ReEU) nonlinearity (Glorot et ah, 20111 ^ 
Einally, the parser state is used to compute 


^The A stack is only ever pushed to; our use of a stack - 

here is purely for implementational and expository conve- “^In preliminary experiments, we tried several nonlineari- 

nience. ties and found ReLU to work slightly better than the others. 
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Figure 2: Parser state computation encountered while parsing the sentence “an overhasty decision was 
made.” Here S designates the stack of partially constructed dependency subtrees and its LSTM encod¬ 
ing; B is the buffer of words remaining to be processed and its LSTM encoding; and A is the stack 
representing the history of actions taken by the parser. These are linearly transformed, passed through a 
ReLU nonlinearity to produce the parser state embedding p^. An affine transformation of this embedding 
is passed to a softmax layer to give a distribution over parsing decisions that can be taken. 


the probability of the parser action at time t as: 

r I ^ exp(gjpt + g2,) 

PAt \ Pt) = ^^-V, 

Ez'eA(S,S) exp (gJ,Pi + qz') 

where is a column vector representing the (out¬ 
put) embedding of the parser action z, and is 
a bias term for action z. The set A{S,B) repre¬ 
sents the valid actions that may be taken given the 
current contents of the stack and buffer^ Since 
Pt = /(sf,bi,at) encodes information about all 
previous decisions made by the parser, the chain 
rule may be invoked to write the probability of any 
valid sequence of parse actions 2 : conditional on 
the input as: 

1^1 

p{z I w) = Y[p{^t I Pi). ( 1 ) 

i=l 


Why arc-standard? Arc-standard transitions 
parse a sentence from left to right, using a stack 
to store partially built syntactic structures and 
a buffer that keeps the incoming tokens to be 
parsed. The parsing algorithm chooses an action 
at each configuration by means of a score. In 
arc-standard parsing, the dependency tree is con¬ 
structed bottom-up, because right-dependents of a 
head are only attached after the subtree under the 
dependent is fully parsed. Since our parser recur¬ 
sively computes representations of tree fragments, 
this construction order guarantees that once a syn¬ 
tactic structure has been used to modify a head, the 
algorithm will not try to find another head for the 
dependent structure. This means we can evaluate 
composed representations of tree fragments incre¬ 
mentally; we discuss our strategy for this below 

(Q- 


3.2 Transition Operations 

Our parser is based on the arc-standard transition 
inventory (Nivre, 2004), given in Figure 


^In general, A(S, B) is the complete set of parser actions 
discussed in §3.2[ but in some cases not all actions are avail¬ 
able. For example, when S is empty and words remain in B, 
a SHIFT operation is obligatory i Sartorio et al., 2013 i. 


3.3 Token Embeddings and OOVs 

To represent each input token, we concatenate 
three vectors: a learned vector representation for 
each word type (w); a fixed vector representa¬ 
tion from a neural language model (wLM)^ and a 
learned representation (t) of the POS tag of the to¬ 
ken, provided as auxiliary input to the parser. A 
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Figure 3: Parser transitions indicating the action applied to the stack and buffer and the resulting stack 
and buffer states. Bold symbols indicate (learned) embeddings of words and relations, script symbols 
indicate the corresponding words and relations. 


linear map (V) is applied to the resulting vector 
and passed through a component-wise ReLU, 

X = max {0, V[w; wlm; t] -|- b} . 

This mapping can be shown schematically as in 
Figure]^ 
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Figure 4: Token embedding of the words decision, 
which is present in both the parser’s training data 
and the language model data, and overhasty, an 
adjective that is not present in the parser’s training 
data but is present in the LM data. 

This architecture lets us deal flexibly with out- 
of-vocabulary words—^both those that are OOV in 
both the very limited parsing data but present in 
the pretraining LM, and words that are OOV in 
both. To ensure we have estimates of the OOVs in 
the parsing training data, we stochastically replace 
(with p = 0.5) each singleton word type in the 
parsing training data with the UNK token in each 
training iteration. 


et ah, 2013] ), and we set the window size to 5, used 
a negative sampling rate to 10, and ran 5 epochs 
through unannotated corpora described in §5.1[ 

3.4 Composition Functions 


Recursive neural network models enable complex 
phrases to be represented compositionally in terms 
of their parts and the relations that link them 


(Socher et ah, 2011 

Socher et ah, 2013c 

Her- 

mann and Blunsom, 2013 

Socher et ah, 2013b 1. 


We follow this previous line of work in embed¬ 
ding dependency tree fragments that are present in 
the stack S in the same vector space as the token 
embeddings discussed above. 

A particular challenge here is that a syntactic 
head may, in general, have an arbitrary number 
of dependents. To simplify the parameterization 
of our composition function, we combine head- 
modifier pairs one at a time, building up more 
complicated structures in the order they are “re¬ 
duced” in the parser, as illustrated in Figure 
Each node in this expanded syntactic tree has a 
value computed as a function of its three argu¬ 
ments: the syntactic head (h), the dependent (d), 
and the syntactic relation being satisfied (r). We 
define this by concatenating the vector embed¬ 
dings of the head, dependent and relation, apply¬ 
ing a linear operator and a component-wise non¬ 
linearity as follows: 


Pretrained word embeddings. A veritable cot¬ 
tage industry exists for creating word embeddings, 
meaning numerous pretraining options for wlm 
are available. However, for syntax modeling prob¬ 
lems, embedding approaches which discard order 
perform less well ( [Bansal et ah, 2014 ); therefore 
we used a variant of the skip n-gram model in¬ 
troduced by Ling et al. (2015| ), named “structured 
skip n-gram,” where a different set of parameters 
is used to predict each context word depending on 
its position relative to the target word. The hy¬ 
perparameters of the model are the same as in the 


skip n-gram model defined in word2vec (Mikolov 


c = tanh (U[h; d; r] -f e). 

For the relation vector, we use an embedding of 
the parser action that was applied to construct the 
relation (i.e., the syntactic relation paired with the 
direction of attachment). 

4 Training Procedure 

We trained our parser to maximize the conditional 
log-likelihood (Eq. [TJ of treebank parses given 
sentences. Our implementation constructs a com¬ 
putation graph for each sentence and runs forward- 
and backpropagation to obtain the gradients of this 
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Figure 5: The representation of a depen¬ 
dency subtree (above) is computed by re¬ 
cursively applying composition functions to 
(head, modifier, relation) triples. In the case of 
multiple dependents of a single head, the recur¬ 
sive branching order is imposed by the order of 
the parser’s reduce operations (below). 


objective with respect to the model parameters. 
The computations for a single parsing model were 
run on a single thread on a CPU. Using the dimen¬ 
sions discussed in the next section, we required 
between 8 and 12 hours to reach convergence on a 
held-out dev set0 

Parameter optimization was performed using 
stochastic gradient descent with an initial learn¬ 
ing rate of 770 = 0.1, and the learning rate was 
updated on each pass through the training data as 
Vt = ^ 0/(1 + "'ilh p = 0.1 and where t is the 
number of epochs completed. No momentum was 
used. To mitigate the effects of “exploding” gra¬ 
dients, we clipped the £2 norm of the gradient to 5 
before applying the weight update rule ([Sutskever 


et ah, 2014t Graves, 20131. An (.2 penalty of 


1 X 10“® was applied to all weights. 

Matrix and vector parameters were initialized 
with uniform samples in ±\/6/(r -|- c), where r 
and c were the number of rows and columns in the 


structure (Glorot and Bengio, 20101. 


Dimensionality. The full version of our parsing 
model sets dimensionalities as follows. LSTM 
hidden states are of size 100, and we use two lay¬ 
ers of LSTMs for each stack. Embeddings of the 
parser actions used in the composition functions 
have 16 dimensions, and the output embedding 
size is 20 dimensions. Pretained word embeddings 
have 100 dimensions (English) and 80 dimensions 
(Chinese), and the learned word embeddings have 

^Software for replicating the experiments is available 
from https://github.com/clab/lstm-parser 


32 dimensions. Part of speech embeddings have 
12 dimensions. 

These dimensions were chosen based on in¬ 
tuitively reasonable values (words should have 
higher dimensionality than parsing actions, POS 
tags, and relations; ESTM states should be rela¬ 
tively large), and it was confirmed on development 
data that they performed wellj^ Euture work might 
more carefully optimize these parameters; our re¬ 
ported architecture strikes a balance between min¬ 
imizing computational expense and finding solu¬ 
tions that work. 


5 Experiments 

We applied our parsing model and several varia¬ 
tions of it to two parsing tasks and report results 
below. 


5.1 Data 


We used the same data setup as Chen and Manning 


(2014|), namely an English and a Chinese parsing 


task. This baseline configuration was chosen since 
they likewise used a neural parameterization to 
predict actions in an arc-standard transition-based 
parser. 


Eor English, we used the Stanford Depen- 
dencency (SD) treebank ( |de Marneffe et aU 
20061 used in ( |Chen and Manning, 2014 1 
which is the closest model published, with 
the same splitsj^ The part-of-speech tags 
are predicted by using the Stanford Tagger 


(Toutanova et ah, 20031 with an accuracy 
of 97.3%. This treebank contains a negligi¬ 
ble amount of non-projective arcs (|Chen and 


Manning, 20141. 


Eor Chinese, we use the Penn Chinese Tree- 
bank 5 .1 (CTB5) following Zhang and Clark 
(2008 1 ^ with gold part-of-speech tags which 


is also the same as in Chen and Manning 


(20141. 


Eanguage model word embeddings were gener¬ 
ated, for English, from the AEP portion of the En¬ 
glish Gigaword corpus (version 5), and from the 
complete Chinese Gigaword corpus (version 2), 

^We did perform preliminary experiments with LSTM 
states of 32, 50, and 80, but the other dimensions were our 
initial guesses. 

^Training: 02-21. Development: 22. Test: 23. 

^Training: 001-815, 1001-1136. Development: 886- 
931, 1148-1151. Test: 816-885, 1137-1147. 

































as segmented by the Stanford Chinese Segmenter 
( [Tseng et al., 2005| ). 


5.2 Experimental configurations 

We report results on five experimental configu¬ 


rations per language, as well as the Chen and 


Manning (2014) baseline. These are: the full 


stack LSTM parsing model (S-LSTM), the stack 
LSTM parsing model without POS tags (—POS), 
the stack LSTM parsing model without pretrained 
language model embeddings (—pretraining), the 
stack LSTM parsing model that uses just head 
words on the stack instead of composed represen¬ 
tations (—composition), and the full parsing model 
where rather than an LSTM, a classical recurrent 
neural network is used (S-RNN). 


in the various ablated conditions we report. The 
one exception to this is the —POS condition for 
the Chinese parsing task, which in which we un¬ 
derperform their baseline (which used gold POS 
tags), although we do still obtain reasonable pars¬ 
ing performance in this limited case. We note 
that predicted POS tags in English add very lit¬ 
tle value—suggesting that we can think of parsing 
sentences directly without first tagging them. We 
also find fhaf using composed represenfafions of 
dependency free fragmenfs oufperforms using rep¬ 
resenfafions of head words alone, which has im- 
plicafions for fheories of headedness. Finally, we 
find fhaf while LSTMs oufperform baselines fhaf 
use only classical RNNs, fhese are still quite capa¬ 
ble of learning good represenfafions. 


5.3 Results 


Following Chen and Manning (2014) we exclude 
puncfuafion symbols for evaluation. Tables [T] and 
show comparable resulfs wifh Chen and Man- 


jning (20l4] |, and we show fhaf our model is heifer 
than their model in both the development set and 
the test set. 



Development 
UAS FAS 

Test 

UAS FAS 

S-FSTM 

93.2 

90.9 

93.1 

90.9 

-POS 

93.1 

90.4 

92.7 

90.3 

—pretraining 

92.7 

90.4 

92.4 

90.0 

—composition 

92.7 

89.9 

92.2 

89.6 

S-RNN 

92.8 

90.4 

92.3 

90.1 

C&M (2014) 

92.2 

89.7 

91.8 

89.6 


Table 1: English parsing results (SD) 



Dev. set 

Test set 


UAS 

FAS 

UAS 

FAS 

S-FSTM 

87.2 

85.9 

87.2 

85.7 

—composition 

85.8 

84.0 

85.3 

83.6 

—pretraining 

86.3 

84.7 

85.7 

84.1 

-POS 

82.8 

79.8 

82.2 

79.1 

S-RNN 

86.3 

84.7 

86.1 

84.6 

C&M (2014) 

84.0 

82.4 

83.9 

82.4 


Table 2: Chinese parsing results (CTB5) 


5.4 Analysis 

Overall, our parser substantially outperforms the 


baseline neural network parser of Chen and Man¬ 


ning (20l4|), both in the full configuration and 


Effect of beam size. Beam search was deter¬ 
mined to have minimal impact on scores (abso¬ 
lute improvements of < 0.3% were possible with 
small beams). Therefore, all results we report 
used greedy decoding— Chen and Manning (2014|| 


likewise only report results with greedy decoding. 
This finding is in line wifh previous work fhaf gen- 


erafes sequences from recurrenf nefworks (Grefen- 
sfeffe ef al., 2014| |, alfhough Vinyals ef al. (2015] ) 
did reporf much more subsfanfial improvemenfs 
wifh beam search on fheir “grammar as a foreign 
language” parser}^ 

6 Related Work 

Our approach ties together several strands of pre¬ 
vious work. First, several kinds of stack memories 
have been proposed to augment neural architec¬ 


tures. Das et al. (1992) proposed a neural network 


with an external stack memory based on recur¬ 
rent neural networks. In contrast to our model, in 
which the entire contents of the stack are summa¬ 
rized in a single value, in their model, the network 
could only see the contents of the top of the stack. 


Mikkulainen (1996) proposed an architecture with 


a stack that had a summary feature, although the 
stack control was learned as a latent variable. 

A variety of authors have used neural networks 
to predict parser actions in shift-reduce parsers. 
The earliest attempt we are aware of is due to 


Mayberry and Miikkulainen (1999). The resur¬ 


gence of interest in neural networks has resulted 


Although superficially similar to ours, Vinyals et al. 
1(2015) is a phrase-structure parser and adaptation to the de¬ 
pendency parsing scenario would have been nontrivial. We 
discuss their work in !|^ 











































in in several applications to transition-based de¬ 


pendency parsers (Weiss et al., 2015 Chen and 


Manning, 2014[ Stenetorp, 20131. In these works, 


the conditioning structure was manually crafted 
and sensitive to only certain properties of the state, 
while we are conditioning on the global state ob¬ 
ject. Like us, Stenetorp (2013| l used recursively 
composed representations of the tree fragments 
(a head and its dependents). Neural networks 
have also been used to learn representations for 


use in chart parsing ( 

Henderson, 2004 Titov and 

Henderson, 2007[ Socher et al., 2013a Ee and 

Zuidema, 2014 

1 . 


LSTMs have also recently been demonstrated 
as a mechanism for learning to represent parse 
structure Vinyals et al. (20 IS) proposed a phrase- 
structure parser based on LSTMs which operated 
by first reading the entire input sentence in so as 
to obtain a vector representation of it, and then 
generating bracketing structures sequentially con¬ 
ditioned on this representation. Although super¬ 
ficially similar fo our model, fheir approach has 
a number of disadvanfages. Firsf, fhey relied on 
a large amounf of semi-supervised fraining dafa 
fhaf was generated by parsing a large unanno- 
fafed corpus wifh an off-lhe-shelf parser. Sec¬ 
ond, while fhey recognized fhaf a slack-like shifl- 
reduce parser conlrol provided useful informafion, 
fhey only made fhe fop word of Ihe slack visible 
during fraining and decoding. Third, allhough if 
is impressive feal of learning fhaf an entire parse 
free be represenfed by a veclor, if seems fhaf fhis 
formulalion makes fhe problem unnecessarily dif- 
ficull. 


Finally, our work can be understood as a pro¬ 
gression foward using larger conlexfs in parsing. 
An exhaustive summary is beyond fhe scope of 
fhis paper, bul some of fhe imporlanl mileslones 
in fhis fradilion are fhe use of cube pruning lo ef- 
ficienfly include nonlocal feafures in discrimina¬ 
tive charf reranking ( [Huang and Chiang, 20081 , 
approximale decoding lechniques based on LP re- 
laxalions in graph-based parsing fo include higher- 
order feafures ( jMarfins ef al., 2010 ), and random¬ 
ized hill-climbing melhods fhaf enable arbifrary 
nonlocal feafures in global discriminalive parsing 
models ( jZhang ef al., 2014] ). Since our parser is 
sensilive fo any pari of fhe inpul, ils hislory, or ifs 
slack conlenfs, if is similar in spiril lo fhe Iasi ap¬ 
proach, which permils Iruly arbifrary feafures. 


7 Conclusion 


We presented slack LSTMs, recurrenl neural nel- 
works for sequences, wifh push and pop opera¬ 
tions, and used Ihem to implemenl a slale-of-lhe- 
arf Iransifion-based dependency parser. We con¬ 
clude by remarking fhaf slack memory offers in- 
Iriguing possibilifies for learning to solve general 


information processing problems (Mikkulainen, 
|1996| ). Here, we learned from observable slack 
manipulalion operalions (i.e., supervision from a 
Ireebank), and fhe compuled embeddings of final 
parser slates were nol used for any furlher predic¬ 
tion. However, fhis could be reversed, giving a de¬ 
vice fhaf learns lo conslrucf conlexl-free programs 
(e.g., expression frees) given only observed ouf- 
puls; one application would be unsupervised pars¬ 
ing. Such an extension of fhe work would make 
if an allemalive lo archilecfures fhaf have an ex- 
plicif external memory such as neural Turing ma¬ 


chines (Graves el al., 2014l and memory nelworks 
(jWeslon el al., 2015||. However, as wifh Ihose 


models, wifhoul supervision of fhe slack opera¬ 
tions, formidable compulalional challenges musl 
be solved (e.g., marginalizing over all lalenl slack 
operations), bul sampling techniques and tech¬ 
niques from reinforcemenl learning have promise 
here ( [Zaremba and Sulskever, 2015| l, making Ihis 
an inlriguing avenue for fulure work. 


Acknowledgments 

The aulhors would like to lhank Lingpeng Kong 
and Jacob Eisenslein for commenls on an earlier 
version of Ihis draft and Danqi Chen for assis- 
lance wifh fhe parsing dalasels. This work was 
sponsored in pari by Ihe U. S. Army Research 
Laboratory and Ihe U. S. Army Research Office 
under conlract/granl number W91 INF-10-1-0533, 
and in pari by NSF CAREER granl IIS-1054319. 
Miguel Ballesteros is supported by Ihe European 
Commission under Ihe conlracl numbers EP7-ICT- 
610411 (projecl MUETISENSOR) and H2020- 
RIA-645012 (projecl KRISTINA). 


References 

[Ballesteros and Bohnet2014] Miguel Ballesteros and 
Bernd Bohnet. 2014. Automatic feature selec¬ 
tion for agenda-based dependency parsing. In Proc. 
COLING. 

[Ballesteros and Nivre2014] Miguel Ballesteros and 
Joakim Nivre. 2014. MaltOptimizer; Fast and 



















































effective parser optimization. Natural Language 
Engineering. 

[Bansal et al.2014] Mohit Bansal, Kevin Gimpel, and 
Karen Livescu. 2014. Tailoring continuous word 
representations for dependency parsing. In Proc. 
ACL. 

[Bohnet and Nivre2012] Bernd Bohnet and Joakim 
Nivre. 2012. A transition-based system for joint 
part-of-speech tagging and labeled non-projective 
dependency parsing. In Proc. EMNLP. 

[Chen and Manning2014] Danqi Chen and Christo¬ 
pher D. Manning. 2014. A fast and accurate de¬ 
pendency parser using neural networks. In Proc. 
EMNLP. 

[Chen et al.2014] Wenliang Chen, Yue Zhang, and Min 
Zhang. 2014. Feature embedding for dependency 
parsing. In Proc. COLING. 

[Choi and McCallum2013] Jinho D. Choi and Andrew 
McCallum. 2013. Transition-based dependency 
parsing with selectional branching. In Proc. ACL. 

[Das et al.l992] Sreerupa Das, C. Lee Giles, and Guo- 
Zheng Sun. 1992. Learning context-free grammars: 
Capabilities and limitations of a recurrent neural net¬ 
work with an external stack memory. In Proc. Cog¬ 
nitive Science Society. 

[de Marneffe et al.2006] Marie-Catherine de Mameffe, 
Bill MacCartney, and Christopher D. Manning. 
2006. Generating typed dependency parses from 
phrase structure parses. In Proc. LREC. 

[Glorot and Bengio2010] Xavier Glorot and Yoshua 
Bengio. 2010. Understanding the difficulty of train¬ 
ing deep feedforward neural networks. In Proc. 
ICML. 

[Glorot et al.2011] Xavier Glorot, Antoine Bordes, and 
Yoshua Bengio. 2011. Deep sparse rectifier neural 
networks. In Proc. AISTATS. 

[Goldberg et al.2013] Yoav Goldberg, Kai Zhao, and 
Liang Huang. 2013. Efficient implementation of 
beam-search incremental parsers. In Proc. ACL. 

[Graves and Schmidhuber2005] Alex Graves and 
Jurgen Schmidhuber. 2005. Framewise phoneme 
classification with bidirectional LSTM networks. In 
Proc. IJCNN. 

[Graves et al.2007] Alex Graves, Santiago Fernandez, 
and Jurgen Schmidhuber. 2007. Multi-dimensional 
recurrent neural networks. In Proc. ICANN. 

[Graves et al.2014] Alex Graves, Greg Wayne, and Ivo 
Danihelka. 2014. Neural Turing machines. CoRR, 
abs/1410.540L 

[Graves2013] Alex Graves. 2013. Generating se¬ 
quences with recurrent neural networks. CoRR, 
abs/1308.0850. 


[Grefenstette et al.2014] Edward Grefenstette, 

Karl Moritz Hermann, Georgiana Dinu, and 
Phil Blunsom. 2014. New directions in vector 
space models of meaning. ACL Tutorial. 

[Henderson2004] James Henderson. 2004. Discrim¬ 
inative training of a neural network discriminative 
parser. In Proc. ACL. 

[Hermann and Blunsom2013] Karl Moritz Hermann 
and Phil Blunsom. 2013. The role of syntax in 
vector space models of compositional semantics. In 
Proc. ACL. 

[Hochreiter and Schmidhuberl997] Sepp Hochreiter 
and Jurgen Schmidhuber. 1997. Long short-term 
memory. Neural Computation, 9(8): 1735-1780. 

[Huang and Chiang2008] Liang Huang and David Chi- 
ang. 2008. Forest reranking: Discriminative parsing 
with non-local features. In Proc. ACL. 

[Le and Zuidema2014] Phong Le and Willem Zuidema. 
2014. Inside-outside recursive neural network 
model for dependency parsing. In Proc. EMNLP. 

[Ling et al.2015] Wang Ling, Chris Dyer, Alan Black, 
and Isabel Trancoso. 2015. Two/too simple adap¬ 
tations of word2vec for syntax problems. In Proc. 
NAACL. 

[Martins et al.2010] Andre F. T. Martins, Noah A. 
Smith, Eric P. Xing, Pedro M. Q. Aguiar, and Mario 
A. T. Figueiredo. 2010. Turboparsers: Dependency 
parsing by approximate variational inference. In 
Proc. EMNLP. 

[Mayberry and Miikkulainen 1999] Marshall R. May¬ 
berry and Risto Miikkulainen. 1999. SardSrn: A 
neural network shift-reduce parser. In Proc. IJCAI. 

[Mikkulainenl996] Risto Mikkulainen. 1996. Sub- 
symbolic case-role analysis of sentences with em¬ 
bedded clauses. Cognitive Science, 20:47-73. 

[Mikolov et al.2013] Tomas Mikolov, Ilya Sutskever, 
Kai Chen, Greg S Corrado, and Jeff Dean. 2013. 
Distributed representations of words and phrases 
and their compositionality. In Proc. NIPS. 

[Nivre2003] Joakim Nivre. 2003. An efficient algo¬ 
rithm for projective dependency parsing. In Proc. 
IWPT. 

[Nivre2004] Joakim Nivre. 2004. Incrementality in de¬ 
terministic dependency parsing. In Proceedings of 
the Workshop on Incremental Parsing: Bringing En¬ 
gineering and Cognition Together. 

[Nivre2007] Joakim Nivre. 2007. Incremental non- 
projective dependency parsing. In Proc. NAACL. 

[Nivre2008] Joakim Nivre. 2008. Algorithms for de¬ 
terministic incremental dependency parsing. Com¬ 
putational Linguistics, 34:4:513-553. MIT Press. 



[Nivre2009] Joakim Nivre. 2009. Non-projective de¬ 
pendency parsing in expected linear time. In Proc. 
ACL. 

[Pascanu et al.2014] Razvan Pascanu, ^aglar Giilgehre, 
KyunghyunCho, and YoshuaBengio. 2014. Howto 
construct deep recurrent neural networks. In Proc. 
ICLR. 

[Sartorio et al.2013] Francesco Sartorio, Giorgio Satta, 
and Joakim Nivre. 2013. A transition-based depen¬ 
dency parser using a dynamic parsing strategy. In 
Proc. ACL. 

[Socher et al.2011] Richard Socher, Eric H. Huang, Jef¬ 
frey Pennington, Andrew Y. Ng, and Christopher D. 
Manning. 2011. Dynamic pooling and unfolding 
recursive autoencoders for paraphrase detection. In 
Proc. NIPS. 

[Socher et al.2013a] Richard Socher, John Bauer, 
Christopher D. Manning, and Andrew Y. Ng. 
2013 a. Parsing with compositional vector gram¬ 
mars. In Proc. ACL. 

[Socher et al.2013b] Richard Socher, Andrej Karpathy, 
Quoc V. Le, Christopher D. Manning, and An¬ 
drew Y. Ng. 2013b. Grounded compositional se¬ 
mantics for finding and describing images with sen¬ 
tences. TACL. 

[Socher et al.2013c] Richard Socher, Alex Perelygin, 
Jean Y. Wu, Jason Chuang, Christopher D. Manning, 
Andrew Y. Ng, and Christopher Potts. 2013c. Re¬ 
cursive deep models for semantic compositionality 
over a sentiment treebank. In Proc. EMNLP. 

[Stenetorp2013] Pontus Stenetorp. 2013. Transition- 
based dependency parsing using recursive neural 
networks. In Proc. NIPS Deep Learning Workshop. 

[Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals, 
and Quoc V. Le. 2014. Sequence to sequence learn¬ 
ing with neural networks. In Proc. NIPS. 

[Titov and Henderson2007] Ivan Titov and James Hen¬ 
derson. 2007. Constituent parsing with incremental 
sigmoid belief networks. In Proc. ACL. 

[Toutanova et al.2003] Kristina Toutanova, Dan Klein, 
Christopher D. Manning, and Yoram Singer. 2003. 
Feature-rich part-of-speech tagging with a cyclic de¬ 
pendency network. In Proc. NAACL. 

[Tseng et al.2005] Huihsin Tseng, Pichuan Chang, 
Galen Andrew, Daniel Jurafsky, and Christopher 
Manning. 2005. A conditional random field word 
segmenter for SIGHAN bakeoff 2005. In Proc. 
Fourth SIGHAN Workshop on Chinese Language 
Processing. 

[Vinyals et al.2015] Oriol Vinyals, Lukasz Kaiser, 
Terry Koo, Slav Petrov, Ilya Sutskever, and Geof¬ 
frey Hinton. 2015. Grammar as a foreign language. 
In Proc. ICLR. 


[Weiss et al.2015] David Weiss, Christopher Alberti, 
Michael Collins, and Slav Petrov. 2015. Structured 
training for neural network transition-based parsing. 
In Proc. ACL. 

[Weston et al.2015] Jason Weston, Sumit Chopra, and 
Antoine Bordes. 2015. Memory networks. In Proc. 
ICLR. 

[Yamada and Matsumoto2003] Hiroyasu Yamada and 
Yuji Matsumoto. 2003. Statistical dependency anal¬ 
ysis with support vector machines. In Proc. IWPT. 

[Zaremba and Sutskever2015] Wojciech Zaremba and 
Ilya Sutskever. 2015. Reinforcement learning neu¬ 
ral Turing machines. ArXiv e-prints, May. 

[Zhang and Clark2008] Yue Zhang and Stephen Clark. 
2008. A tale of two parsers: Investigating and com¬ 
bining graph-based and transition-based dependency 
parsing. In Proc. EMNLP. 

[Zhang and Nivre2011] Yue Zhang and Joakim Nivre. 
2011. Transition-based dependency parsing with 
rich non-local features. In Proc. ACL. 

[Zhang et al.2014] Yuan Zhang, Tao Lei, Regina Barzi- 
lay, and Tommi Jaakkola. 2014. Greed is good if 
randomized: New inference for dependency parsing. 
In Proc. EMNLP. 



